Enhancing tunnel crack detection with linear seam using mixed stride convolution and attention mechanism

Cracks in tunnel lining structures constitute a common and serious problem that jeopardizes the safety of traffic and the durability of the tunnel. The similarity between lining seams and cracks in terms of strength and morphological characteristics renders the detection of cracks in tunnel lining structures challenging. To address this issue, a new deep learning-based method for crack detection in tunnel lining structures is proposed. First, an improved attention mechanism is introduced for the morphological features of lining seams, which not only aggregates global spatial information but also features along two dimensions, height and width, to mine more long-distance feature information. Furthermore, a mixed strip convolution module leveraging four different directions of strip convolution is proposed. This module captures remote contextual information from various angles to avoid interference from background pixels. To evaluate the proposed approach, the two modules are integrated into a U-shaped network, and experiments are conducted on Tunnel200, a tunnel lining crack dataset, as well as the publicly available crack datasets Crack500 and DeepCrack. The results show that the approach outperforms existing methods and achieves superior performance on these datasets.

information, resulting in limitations in capturing global information.As a result, the resulting images of cracks are often plagued by noise and incompleteness.
Rapid advancements in deep learning for automatic feature extraction have profoundly influenced crack detection research.Current deep learning methods primarily utilize networks such as HED 9 and U-Net 10 , which, despite their effectiveness, often struggle to differentiate between lining joints and cracks in tunnel structures due to their similar appearances.To address these challenges, several enhancements have been made to these networks.For instance, Liu et al. 11 enhanced the HED network by implementing a deeply supervised training strategy and integrating Guided Filters and Conditional Random Fields to refine detection details.Yang et al. 12 incorporated a feature pyramid and hierarchical boosting module to enhance detection accuracy.Similarly, enhancements to the U-Net network include Han et al. 's 13 implementation of a round-trip sampling block in place of the traditional skip-layer connections, and Zhou et al. 14 who introduced a mixed attention module and a multi-scale feature fusion strategy within the skip-layer connections to enhance performance in detecting cracks in tunnel linings.These developments represent a concerted and strategic effort to overcome the limitations of current deep learning approaches in distinguishing structural elements in complex environments.
However, in environments with lining structures, these methods often fail to effectively differentiate between lining seams and cracks, leading to frequent cases of false detections.To address this, many researchers have proposed using attention mechanisms and strip convolutions to improve performance, specifically targeting objects with unique shapes.Attention mechanisms are typically divided into three types: channel, spatial, and self-attention.Channel attention, as utilized in SENet 15 and ECANet 16 , prioritizes relevant image channels, while spatial attention mechanisms, such as those in Non-Local neural networks 17 , focus on specific image regions.Additionally, models such as the Convolutional bottleneck attention module (CBAM) 18 , Dual Attention 19 , Multidimensional Collaborative Attention Module 20 leverage these approaches, using sequential or parallel execution strategies to optimize feature integration.Strip convolution enables models to focus more on features in a specific direction of an image (such as horizontal or vertical), thus effectively complementing global features.Many models have implemented this method, including the Inception architecture 21 , which suggests decomposing an N × N convolutional layer into two layers of 1 × N and N × 1 to improve computational efficiency.Additionally, strip convolution has been utilized across various specific fields.Mei et al 22 .introduced a strip convolution module for road segmentation scenarios to capture long-distance dependencies and enhance segmentation performance.Zhou et al 23 .employed strip convolution to collect more detailed features of cracks, acting as a robust complement to features extracted by conventional convolutions.Yang et al 24 .developeda multi-scale feature convolution attention network, named MSFCA-Net, utilizing different sizes of strip convolutions to segment field crops and weeds.Liao and colleagues 25 harnessed strip convolution to improve feature extraction of rice seedling leaves, thereby providing strong support for the development of intelligent weeding technologies.
To address the limitations of existing crack detection methods, a new deep learning-based tunnel lining crack detection method based on attention mechanisms and strip convolution.As shown in Fig. 1, the morphological features of cracks and lining seams are very similar, making it easy to mistakenly detect lining seams as cracks.It was observed that lining seams mostly appear in horizontal or vertical shapes, while cracks often appear as curves.Therefore, starting from the shape characteristics of lining seams and cracks, an improved attention module and mixed strip convolution.The improved attention module adopts a multi-branch structure, especially aggregating features in the height (H) and width (W) dimensions, which can effectively capture long-distance features in horizontal and vertical directions to distinguish between cracks and lining seams.On the other hand, based on the morphological characteristics of cracks, to enhance feature extraction, mixed strip convolution has been implemented.This method builds upon the foundational horizontal and vertical strip convolutions by incorporating diagonal strip convolutions in both left and right directions.Such an approach enables the further capture of long-distance dependencies of cracks, facilitating the differentiation between cracks and lining seams.Our method incorporates several key modules to enhance the network's performance in recognizing cracks and lining seams.The overall network structure is depicted in Fig. 2. Firstly, an improved attention module is introduced that captures long-range dependencies between cracks and lining seams.Drawing on the attention placement in networks from the DA-TransUNet 26 and AttentionU-Net 27 methods, this module is seamlessly integrated into the network, positioned between the encoder and decoder of the U-shaped network, to enhance the network's ability to recognize lining seam features.Furthermore, to effectively address the challenges posed by the large, narrow, and continuous distribution of cracks, a mixed strip convolution module is incorporated in the decoder.This module employs four strip convolutions, including horizontal, vertical, left diagonal, and right diagonal directions, to capture remote contextual information and minimize interference from irrelevant regions.By integrating these proposed modules into the U-shape structure, our method can accurately detect the cracked areas in tunnel lining structures and improve safety during tunnel operation.
The contributions to the field of tunnel lining crack detection include the following: (1) Based on the morphological characteristics of lining seams, an improved attention mechanism is proposed that effectively distinguishes between crack and lining seam features by aggregating features along two spatial dimensions (height and width) and complementing global spatial feature information.(2) Based on the morphological characteristics of cracks, a mixed strip convolution module is employed that captures remote contextual information in four distinct directions and mitigates interference from irrelevant regions.(3) A novel deep learning-based crack detection network is introduced that surpasses existing methods in performance on the Tunnel200 dataset and demonstrates strong performance on the publicly available DeepCrack and Crack500 datasets.
The structure of the remaining sections in this paper is as follows: "Introduction" presents a summary of related work on crack detection and attention mechanisms."The proposed methods" provides a detailed description of the proposed network."Experimental results and analysis" contains details about the dataset, evaluation metrics, implementation specifics, and a series of experiments aimed at evaluating the performance of the crack detection method in tunnel lining structures.Lastly, "Conclusion" concludes the paper and offers insights into future research directions.

Figure 2.
The comprehensive structure of the proposed network is depicted.The network can be dissected into three principal components: the encoding phase, the skip connection segment, and the decoding phase.Within the skip connection segment, we have incorporated an enhanced attention mechanism to bolster the network's feature extraction capabilities.The receptive field enhancement module, employing parallel dilated convolutions, is applied to the lower and red blocks of the network to augment the network's receptive field.In the decoding stage, a mixed strip convolution module is implemented to capture the characteristics of both crack and lining joints better.Finally, the network's output is D 1 .

Improved attention mechanism
In this section, an improved attention mechanism is proposed that captures both channel attention and spatial attention through a three-branch structure, as illustrated in Fig. 3.The initial branch utilizes channel pooling to efficiently distribute weights across distinct spatial locations, while the remaining two branches focus on the interplay between channels and the height (H) and width (W) dimensions of the input tensor, respectively.In conventional approaches, the potential interactions between channel attention and spatial attention often go unexplored.To rectify this limitation, an innovative concept of cross-dimension interaction is introduced, drawing inspiration from the methodology employed in constructing spatial attention mechanisms.This novel approach effectively captures the interplay between the spatial dimensions and the channel dimension of the input tensor, culminating in a more cohesive and comprehensive model.
The proportion of pixels that display cracks in the entire image is relatively low, and a significant portion of these pixels exhibit elongated structures.As a result, directly applying the existing spatial attention mechanism proves to have limited efficacy in the task of crack detection.To tackle this challenge, an attention mechanism for feature aggregation has been devised those functions in both the horizontal and vertical directions.This approach enables the acquisition of more precise spatial feature information and the accommodation of the long-range dependencies inherent in cracks.The feature aggregation process involves pooling the input feature X using (H,1) or (1,W) pooling kernels sized to the image dimensions.The aggregated features along the height direction are expressed as follows: where X H and X W represent the feature map obtained by pooling the input feature along the height and width dimensions, respectively.Then, a 1 × 1 convolutional layer is applied to transform the features, resulting in: (1) (2) (3) www.nature.com/scientificreports/wherein C 1 and C 2 denote the 1 × 1 convolutional transformations, δ signifies the ReLU activation function, and σ denotes the sigmoid function.The output Y 1 can be represented as follows: The first step in the overall spatial attention involves channel pooling, followed by convolution, batch normalization, and ReLU and sigmoid activations to obtain the corresponding weights: The final output Y is defined as follows:

Mixed strip convolution
Conventional CNN networks use square convolutional kernels to learn feature maps, which are suitable for most natural images with blocky shapes.However, crack images are characterized by their large span, long stripes, and continuous distribution.Square convolution fails to adequately capture the linear characteristics of cracks, resulting in the inclusion of extraneous information from adjacent pixels.Strip convolution, which uses long stripes along one spatial direction to capture the long dependencies in the crack region, is more consistent with the morphological characteristics of cracks and lining seams 28 .Furthermore, it captures local context along another spatial direction and mitigates the influence of irrelevant regions on feature learning.To tackle these challenges, the Mixed Strip Convolution Module (MSCM) has been incorporated into the crack detection process, building upon prior research.As depicted in Fig. 4, MSCM captures long-range dependencies in crack information from four distinct directions through horizontal, vertical, left diagonal, and right diagonal strip convolutions.The feature map is represented by the input X ∈ R H×W×C , where H, W, and C stand for the height, width, and number of channels, respectively.To prepare the feature map for processing, a 1 × 1 convolutional layer is applied to adjust its dimensions.Subsequently, the adjusted feature map is fed into four parallel processing branches, each specializing in feature learning along a different direction.Finally, the different feature maps obtained from the four branches are stitched together, upsampled, and convolved by a 1 × 1 layer to generate the final feature map output.Let w ∈ R 2k+1 be a strip convolution filter of size 2k + 1 , D = (D h , D w ) be the direction of the filter w, and Z D ∈ R H×W×C ′ denote the result of strip convolution.The strip convolution can be defined as follows: (4) where X * w denotes the convolution operation.The direction vector of the strip convolution is represented by D, where the values (0,1), (1,0), (1,1), and (-1,1) correspond to the horizontal, vertical, left diagonal, and right diagonal convolutions, respectively.To ensure consistency with the 3 × 3 convolution kernel, k is set to 4 for the filter w, resulting in each strip convolution having nine parameters.

Receptive filed enhance module
The size of the receptive field has a significant impact on a network's ability to perceive target size, with smaller receptive fields being better suited to recognizing smaller targets and larger receptive fields being more adept at recognizing larger targets.The RFE module, which is based on ASPP 29 , features four branches that use the same hole rate of 2,4,8,16 .Each branch utilizes 3 × 3 convolutions with the same dilation rate to achieve an expanded receptive field.To align with the network's channel dimension, the number of channels in the second convolution of each branch is configured to be 128.Ultimately, the outcomes from the different branches are amalgamated, and the ultimate multi-scale feature map is generated using a 1 × 1 convolution.

Loss function
The conventional binary cross-entropy (BCE) loss function is widely used in crack detection.Nonetheless, its application presents challenges, as the number of crack pixels is often significantly lower than that of the background pixels.Using the standard BCE loss during training may cause the model to predominantly emphasize the non-crack pixels, due to their prevalence in the dataset.Consequently, the model may inadvertently acquire features primarily associated with class samples containing a large number of pixels, potentially degrading its performance in crack detection.To address this issue, a weighted BCE loss function is employed.The loss function is defined as follows: where Y + and Y − represent the samples containing cracks and those without cracks, respectively.Furthermore, w 0 and w 1 are used to denote the weights assigned to crack and non-crack pixels, respectively.w 0 is defined as

Experimental results and analysis
Both the proposed method and the compared method were implemented using the PyTorch framework.In the training phase, the images in the dataset were standardized to a size of 448 × 448 pixels.The batch size was set to 4, and the learning rate was set to 1e−4.The Tunnel200 dataset underwent training for 100 epochs, while both the Crack500 and DeepCrack datasets were trained for 300 epochs each.During the decoder stage, upsampling was carried out using the bilinear interpolation method.Batch normalization and ReLU activation were applied in each convolutional layer during both the encoder and decoder stages.The optimizer used was adaptive moment estimation (Adam) with a weight decay of 1e−3.The experiments were conducted on an Ubuntu 16.04 system equipped with a 4-core Intel(R) Xeon(R) Silver CPU and a Tesla V100 32GB GPU.

Datasets
In this paper, the efficacy of the proposed model in detecting cracks in tunnel lining structures is demonstrated, substantiated by experimental results obtained from the Tunnel200 dataset.Furthermore, to exemplify the effectiveness of our proposed approach, it is validated using the publicly accessible DeepCrack and Crack500 crack datasets.Here is a concise overview of these three datasets.
(1) Tunnel200 14 : this dataset was captured by Zhou et al. using a cell phone on a real tunnel lining surface.
The dataset presents significant difficulties in detecting cracks due to severe interference from tunnel light, illumination, and lining seams.T The original image size in the dataset is 2048 × 1536 , and to reduce com- putational effort, the authors uniformly cropped the images to 448 × 448 .The data consists of only 200 images, of which we use 180 as the training set and 20 as the test set.Considering the limited size of this dataset, the 10-fold validation methodology outlined in the original paper is adopted, and the final result is derived by averaging the outcomes from these 10 folds.(2) CRACK500 12 : the CRACK500 dataset was collected by the Temple University team, who employed cell phones to annotate crack defects in intricate pavements.The resolution of the original images is 2000 × 1500 , and a cropping method is used to divide the image into 16 non-overlapping regions, gener- ating 3368 images of 448 × 448 pixels.In this paper, 3000 of these images are utilized for training and 368 are reserved for testing. (

Comparison methods
This paper conducted comparative experiments with a wide array of methods.The following section will provide a brief introduction to the methods that were compared.
(1) HED 9 : this method is based on the full convolutional neural network FCN and incorporates a deep supervision strategy to enhance model performance.Many current crack detection models are built upon this network.
(2) U-Net 10 : this method, commonly used in medical image segmentation, utilizes a U-shaped encoderdecoder architecture and skip connections for feature fusion.Many current crack detection models are based on this network.(3) DeepLabV3+ 30 : this method is a classic network for semantic segmentation and proposes the concept of null convolution to improve the perceptual field and increase global information acquisition without increasing model parameters.(4) DeepCrack 11 : this method, which builds upon the HED network, incorporates guided filters and conditional random fields to enhance the final detection performance.( 5) DeepCrackT 31 : an improvement on the U-Net network, this method fuses encoder and decoder features to improve crack detection results.( 6) FPHBN 12 : This methodincorporates feature pyramid and hierarchy boosting modules into the HED network, aiming to enhance feature propagation and improve network learning.( 7) HACNet 32 : this method incorporates feature pyramid and hierarchy boosting modules into the HED network, aiming to enhance feature propagation and improve network learning.( 8) SwinTransformer 33 : this approach has gained popularity in the field of visual Transformers and has demonstrated promising outcomes in segmentation and detection tasks.In this study, the abbreviation "SwinT" is employed to refer to this method.( 9) TransUnet 34 : this method incorporates a Transformer structure into the U-shape network, enhancing the model's capacity for contextual modeling.It has demonstrated promising results in the domain of medical image segmentation.(10) ECDFFNet 23 : this method is currently the method with better results in the field of crack detection.It proposes Enhanced Convolution and Dynamic Feature Fusion strategies to improve the final detection performance.(11) TCDNet 14 this approach stands as the leading method in the field of crack detection, boasting superior performance.It introduces Enhanced Convolution and Dynamic Feature Fusion strategies aimed at enhancing the final detection performance.

Evaluation metrics
In this study, precision (P), recall (R), the P-R curve, and the F1-score ( F 1 ) are employed as evaluation metrics for assessing the performance of these models.Precision (P), recall (R), and F1-score ( F 1 ) are commonly used evaluation metrics for classification models.Precision quantifies the ratio of true positive predictions to all positive predictions made by the model, while recall quantifies the ratio of true positive predictions to all actual positive instances in the dataset.The F1-score represents the harmonic mean of precision and recall, offering a comprehensive performance metric that considers both aspects.
where TP represents true positives, FP denotes false positives, and FN stands for false negatives.Furthermore, in this paper, crack detection is treated as a binary semantic segmentation task, aiming to differentiate the crack region from the background.To assess the models' performance in this task, three semantic segmentation metrics are utilized: pixel accuracy (PA), mean pixel accuracy (MPA), and mean intersection over union (MIoU).( 12) , where K represents the number of classes (in this case, K = 2 for crack and non-crack), and p ij signifies the count of pixels of class i predicted to belong to class j.Additionally, beyond the previously mentioned metrics, the processing speed of the models is quantified using the FPS (frames per second) metric.

Experimental results
(1) Results on Tunnel200 As illustrated in Table 1, our proposed method demonstrates superior performance compared to current crack detection methods on the Tunnel200 dataset, achieving F 1 and MIoU scores of 0.729 and 0.792, respectively.Notably, the TCDNet network, tailored specifically for tunnel crack detection, outperforms other compared methods, attaining F 1 and MIoU values of 0.704 and 0.763, respectively.HACNet gets the highest accuracy.HACNet maintains the same spatial resolution throughout the network architecture, which is particularly important for detecting targets like cracks that have elongated and subtle features.This design minimizes the potential loss of important spatial details during downsampling, while introducing hybrid atrous convolutions to maintain a larger receptive field, thereby enhancing the precision of crack detection.However, this approach does not consider the features of seams, which can lead to misidentifying seam cuts as cracks.In contrast, classical crack detection networks like DeepCrack, DeepCrackT, and FPHBN exhibit comparatively lower performance on this dataset, with F 1 values of 0.552, 0.536, and 0.451, respectively.Interestingly, the classical U-Net surpasses contemporary crack detection networks on this dataset, achieving an F 1 score of 0.648.The performance of recent Transformer-based network structures was also assessed, including SwinT and TransUnet, which yielded less satisfactory results due to the dataset's limited size, with F 1 scores of only 0.259 and 0.220, respectively.In the Tunnel200 dataset, the proportion of cracks is smaller.This means there are fewer absolute numbers of cracks, presenting a greater challenge for the model to improve recall.The model might miss smaller or less obvious cracks, resulting in a lower recall rate.Visual comparisons of detection results for different models are presented in Fig. 5. Conventional crack detection methods struggle to effectively differentiate between lining seams and crack features, resulting in false positives and missed detections.In contrast, our proposed method demonstrates enhanced accuracy in detecting cracked areas.The Precision-Recall curve in Fig. 6 further illustrates that our proposed method resides in the upper right corner, signifying superior performance compared to other approaches.
(2) Ablation studies To validate the effectiveness of the proposed modules, ablation experiments were conducted to compare the performance of different modules.Specifically, the U-Net network was used as the Baseline ( 16)   2, the Baseline model had the lowest F 1 and MIoU scores of 0.648 and 0.721, respectively.However, the addition of the proposed modules effectively improved the Baseline network's performance, with MSC demonstrating the most significant improvement.Specifically, the F 1 and MIoU scores improved to 0.684   and 0.754, respectively, indicating that the MSC module can effectively distinguish between crack and lining seam features.Additionally, the proposed IAM module also exhibited significant enhancements in the detection accuracy of the Baseline network.
To further explore the combined effect of the proposed modules, experiments were conducted with different two-by-two combinations and it was found that they can mutually enhance each other's performance.Specifically, the RFE+MSC approach showed the most significant improvement, with F 1 and MIoU scores of 0.720 and 0.785, respectively.Finally, by embedding all three proposed modules into the Baseline network, the best experimental results were achieved, with F 1 and MIoU scores improving to 0.729 and 0.792, respectively.
(3) Results on CRACK500 Figure 7 demonstrates that our proposed method outperforms other methods as it resides closer to the upper right corner of the P-R curves, indicating superior performance.To validate the practicality of the proposed model, quantitative tests were conducted on the Crack500 dataset.Table 3 presents the performance results on the Crack500 test set, where our proposed method achieves leading levels in several evaluation metrics, including F 1 and emph MIoU, with values of 0.754 and 0.802, respec- tively.The HACNet model exhibits a high recall on the CRACK500 dataset, which can be attributed to its architecture maintaining the same spatial resolution throughout.Our method also surpasses the latest  TCDNet in terms of accuracy with comparable speed, improving the F 1 value by 0.9 and emph MIoU value by 1.When compared with other classical crack detection networks such as DeepCrack, FPHBN, and ECDFFNet, our proposed method exhibits substantially enhanced detection accuracy.Furthermore, the performance of SwinT and TransUnet, based on Transformer network structures, gradually improves with the increase in data volume in the Crack500 dataset.Therefore, the conducted experiments demonstrate that our proposed method can effectively detect crack defects.
(4) Results on DeepCrack As demonstrated in Fig. 8, our proposed method outperforms the compared methods, as it resides closer to the upper right corner of the PR curves.This finding confirms the superior performance of the approach.To further validate its superiority, quantitative tests were conducted on the DeepCrack dataset.As shown in the results presented in Table 3, our proposed method achieves state-ofthe-art results in several evaluation metrics, including F 1 and emph MIoU, with values of 0.890 and 0.898, respectively.In comparison with the latest TCDNet, our method achieves accuracy improvements with comparable speed, resulting in F 1 and emph MIoU value improvements of 0.7 and 0.7, respectively.Further- more, our proposed method significantly enhances detection accuracy when compared to other classical crack detection networks, including DeepCrack, FPHBN, and ECDFFNet.These results demonstrate that our proposed method can effectively detect crack defects.

Conclusion
In this paper, an innovative approach for detecting cracks in tunnel linings is proposed.An enhanced attention module that more effectively captures long-range, aggregation-dependent information is initially introduced.This enhancement enables the model to distinguish between characteristics of cracks and lining seams with greater accuracy.Additionally, mixed strip convolution is integrated into the decoding stage to improve the model's capacity to capture distant contextual information in four directions: horizontal, vertical, left diagonal, and right diagonal.The effectiveness of the proposed method on the Tunnel200 dataset was assessed, demonstrating its accuracy in detecting cracks in tunnel linings-a critical aspect for ensuring safe tunnel operations.Furthermore, the approach was validated on the Crack500 and DeepCrack pavement crack datasets to highlight its robustness.
In future work, further enhancements to the Transformer structure are planned to better capture contextual information related to the structural characteristics of lining seams, thereby improving the model's performance.Additionally, the development of a multimodal task system that integrates information from various modalities, including video, image, and language, aims to enhance the early detection of tunnel surface defects.

Figure 1 .
Figure 1.An illustrative diagram depicting crack detection in a tunnel lining structure.The top row displays the original image, while the bottom row showcases the detection results achieved by the method proposed in this paper.

Figure 3 .
Figure 3.The structural diagram of the proposed improved attention mechanism module.

Figure 4 .
Figure 4.The structure diagram of the proposed mixed strip convolution module.

and w 1
as w 1 = 1 , with |Y + | and |Y − | representing the total count of crack and non-crack pixels in the entire training dataset.

(
B) and individually embedded the proposed improved attention mechanism (IAM), mixed strip convolution (MSC), and receptive field enhancement module (RFE) to assess their contributions.As shown in Table
11epCrack11: this dataset is a well-established publicly available pavement crack detection dataset, extensively employed for validating the efficacy of algorithms in the realm of crack detection.In this study, the DeepCrack dataset is augmented by integrating the CFD dataset, resulting in a combined dataset comprising 758 images.The dataset is partitioned into 521 images for the training set and 237 for the testing set.

Table 1 .
The evaluation metrics of competing methods on the Tunnel200 dataset.Significant values are in bold. ,

Table 2 .
Ablation analyze for the proposed architecture on Tunnel200 datasets.Significant values are in bold.

Table 3 .
The evaluation metrics of competing methods on the Crack500 and DeepCrack dataset.Significant values are in bold.